Introduction

In this final exam, I analyze and forecast unemployment in the State of New York using the monthly time series variable UnemployedLF from the provided dataset. Forecasting unemployment is important because it helps policymakers, businesses, and households understand the economic health of the region, anticipate labor market pressures, and plan for future conditions.

In this report, I will:


Import Data

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(tseries)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
# Read the New York unemployment dataset
data <- read_csv("Data_NewYorkEdited.csv")

# Inspect the first few rows
head(data)
## # A tibble: 6 × 11
##    Code State          Year Month    NIP     LF Population Employment EmployedLF
##   <dbl> <chr>         <dbl> <dbl>  <dbl>  <dbl>      <dbl>      <dbl>      <dbl>
## 1 51000 New York city  1976     1 5.69e6 3.07e6       53.9    2723016       47.8
## 2 51000 New York city  1976     2 5.69e6 3.07e6       53.9    2722421       47.8
## 3 51000 New York city  1976     3 5.69e6 3.06e6       53.9    2722931       47.9
## 4 51000 New York city  1976     4 5.69e6 3.07e6       53.9    2726299       47.9
## 5 51000 New York city  1976     5 5.69e6 3.07e6       54      2730681       48  
## 6 51000 New York city  1976     6 5.68e6 3.08e6       54.1    2735049       48.1
## # ℹ 2 more variables: Unemployment <dbl>, UnemployedLF <dbl>
# Key columns:
# Year, Month, and UnemployedLF are used to construct the time series

# Ensure correct column names (if needed)
# colnames(data)

# Construct a monthly time series for UnemployedLF
start_year  <- min(data$Year, na.rm = TRUE)
start_month <- min(data$Month[data$Year == start_year], na.rm = TRUE)

unemp_ts <- ts(
  data$UnemployedLF,
  start     = c(start_year, start_month),
  frequency = 12
)

unemp_ts
##       Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
## 1976 11.2 11.2 11.2 11.1 11.1 11.1 11.1 11.1 11.1 11.1 11.0 10.9
## 1977 10.8 10.7 10.5 10.3 10.2 10.1 10.1 10.0 10.0 10.0  9.9  9.8
## 1978  9.6  9.4  9.3  9.1  9.0  8.9  8.8  8.8  8.7  8.7  8.7  8.6
## 1979  8.6  8.6  8.6  8.7  8.7  8.8  8.8  8.7  8.7  8.6  8.5  8.4
## 1980  8.3  8.3  8.4  8.4  8.5  8.5  8.5  8.5  8.5  8.5  8.6  8.7
## 1981  8.8  8.9  9.0  9.0  8.9  8.7  8.6  8.6  8.7  8.9  9.0  9.2
## 1982  9.3  9.4  9.4  9.5  9.5  9.6  9.7  9.8  9.9 10.1 10.2 10.3
## 1983 10.3 10.2 10.2 10.1 10.0 10.0  9.9  9.8  9.6  9.4  9.2  9.1
## 1984  8.9  8.8  8.7  8.7  8.8  9.0  9.1  9.2  9.1  9.0  8.9  8.8
## 1985  8.7  8.7  8.7  8.4  8.2  8.0  7.9  7.8  7.7  7.7  7.7  7.7
## 1986  7.8  7.9  7.9  7.9  7.8  7.6  7.4  7.1  6.9  6.7  6.6  6.4
## 1987  6.3  6.1  5.9  5.7  5.6  5.5  5.4  5.4  5.4  5.3  5.2  5.1
## 1988  4.9  4.8  4.7  4.6  4.7  4.9  5.1  5.2  5.4  5.5  5.6  5.7
## 1989  5.9  6.0  6.1  6.3  6.4  6.5  6.6  6.7  6.7  6.7  6.8  6.8
## 1990  6.7  6.7  6.6  6.5  6.5  6.5  6.6  6.8  7.0  7.2  7.4  7.6
## 1991  7.8  8.1  8.4  8.6  8.8  8.9  9.0  9.0  9.1  9.3  9.6  9.8
## 1992 10.1 10.3 10.6 10.8 11.1 11.3 11.4 11.5 11.5 11.5 11.5 11.4
## 1993 11.3 11.1 10.8 10.6 10.3 10.1 10.0 10.0 10.0 10.0 10.0 10.0
## 1994  9.8  9.7  9.5  9.2  9.0  8.8  8.7  8.5  8.3  8.1  7.9  7.8
## 1995  7.7  7.8  7.9  8.0  8.2  8.3  8.4  8.4  8.4  8.4  8.4  8.4
## 1996  8.4  8.5  8.6  8.7  8.7  8.7  8.7  8.7  8.8  8.9  9.2  9.4
## 1997  9.6  9.7  9.7  9.7  9.7  9.6  9.5  9.3  9.1  8.9  8.7  8.5
## 1998  8.4  8.3  8.2  8.0  7.9  7.8  7.7  7.7  7.7  7.7  7.6  7.4
## 1999  7.3  7.1  7.0  7.0  7.0  7.0  7.0  6.9  6.8  6.6  6.4  6.2
## 2000  6.1  6.0  6.0  5.9  5.9  5.8  5.8  5.7  5.6  5.5  5.4  5.3
## 2001  5.3  5.3  5.3  5.4  5.5  5.6  5.9  6.2  6.6  7.0  7.3  7.6
## 2002  7.8  7.9  8.0  8.1  8.0  8.0  7.9  7.9  8.0  8.1  8.2  8.3
## 2003  8.4  8.4  8.4  8.4  8.4  8.4  8.4  8.3  8.2  8.1  8.0  7.9
## 2004  7.8  7.7  7.6  7.4  7.3  7.1  6.9  6.7  6.5  6.3  6.1  6.0
## 2005  5.9  5.8  5.7  5.6  5.6  5.5  5.6  5.6  5.6  5.6  5.6  5.5
## 2006  5.4  5.3  5.2  5.1  5.1  5.0  4.9  4.7  4.7  4.6  4.5  4.5
## 2007  4.5  4.6  4.7  4.7  4.8  4.9  5.0  5.0  5.0  5.0  5.0  5.0
## 2008  5.0  5.0  5.0  5.0  5.1  5.3  5.4  5.6  5.9  6.2  6.6  7.0
## 2009  7.5  7.9  8.5  8.9  9.2  9.4  9.6  9.8 10.0 10.1 10.1 10.1
## 2010 10.1 10.0  9.9  9.8  9.7  9.6  9.5  9.4  9.4  9.3  9.2  9.1
## 2011  9.0  9.0  8.9  8.9  8.9  9.0  9.1  9.2  9.3  9.4  9.4  9.5
## 2012  9.5  9.6  9.6  9.6  9.6  9.6  9.5  9.4  9.3  9.2  9.2  9.2
## 2013  9.1  9.1  9.0  9.0  8.9  8.9  8.9  8.9  8.7  8.6  8.4  8.2
## 2014  8.0  7.8  7.6  7.5  7.3  7.1  6.9  6.8  6.6  6.5  6.4  6.3
## 2015  6.2  6.1  6.0  5.9  5.7  5.6  5.4  5.2  5.2  5.1  5.2  5.2
## 2016  5.2  5.2  5.2  5.1  5.1  5.2  5.2  5.3  5.2  5.2  5.0  4.9
## 2017  4.7  4.6  4.5  4.5  4.5  4.6  4.6  4.6  4.5  4.5  4.4  4.3
## 2018  4.3  4.2  4.2  4.1  4.1  4.0  4.0  4.0  4.1  4.2  4.3  4.3
## 2019  4.3  4.2  4.1  4.0  3.9  3.8  3.8  3.8  3.8  3.8  3.9  4.0
## 2020   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## 2021   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## 2022   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA

The time series unemp_ts measures the unemployment rate (or unemployment measure) for New York on a monthly basis over several decades.


Plot and Inference

autoplot(unemp_ts) +
  labs(
    title = "New York Unemployment (UnemployedLF) – Monthly Time Series",
    x = "Year",
    y = "UnemployedLF"
  ) +
  theme_minimal()

Observations from the time series plot:


Central Tendency

# Summary statistics
summary(unemp_ts)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   3.800   5.800   8.100   7.657   9.100  11.500      36
# Boxplot
boxplot(
  unemp_ts,
  main = "Boxplot of New York Unemployment (UnemployedLF)",
  ylab = "UnemployedLF"
)

Observations: The boxplot shows the typical range of unemployment levels with some extreme values during recessionary periods. The distribution is not perfectly symmetric; high unemployment episodes create a longer upper tail, indicating that the series is somewhat right-skewed. This reflects the fact that unemployment can spike sharply upward during crises but usually declines more slowly during recoveries.


Decomposition

To better understand the structure of the series, I decompose it into trend, seasonal, and irregular components. Because the magnitude of seasonal fluctuations appears relatively stable across different levels of unemployment, an additive decomposition is generally appropriate. However, I first inspect both additive and multiplicative versions.

# Additive decomposition
decomp_add <- decompose(unemp_ts, type = "additive")

# Multiplicative decomposition
decomp_mult <- decompose(unemp_ts, type = "multiplicative")

par(mfrow = c(2, 1))
plot(decomp_add)
title("Additive Decomposition of New York Unemployment", outer = FALSE)

plot(decomp_mult)
title("Multiplicative Decomposition of New York Unemployment", outer = FALSE)

par(mfrow = c(1, 1))

Visually, the additive decomposition is more appropriate for an unemployment series because the amplitude of seasonal swings does not grow proportionally with the level of the series. Therefore, I proceed with the additive decomposition.

decomp <- decomp_add

# Seasonal component and indices
seasonal_vals <- decomp$seasonal
month <- cycle(unemp_ts)

# Average seasonal index per month
seasonal_index <- tapply(seasonal_vals, month, mean)
seasonal_index
##             1             2             3             4             5 
##  0.0067829457  0.0069767442  0.0048449612 -0.0135658915 -0.0135658915 
##             6             7             8             9            10 
## -0.0139534884 -0.0040697674 -0.0068798450 -0.0001937984  0.0089147287 
##            11            12 
##  0.0134689922  0.0112403101

In the additive framework:

The month with the highest average seasonal index is month 11 (1 = January, 2 = February, …). The month with the lowest index is month 6. These patterns may reflect seasonal hiring, school schedules, holiday periods, and weather-related factors, all of which can influence labor demand and job search behavior.

Seasonally Adjusted Series

# Seasonally adjusted unemployment (additive)
adj_ts <- unemp_ts - seasonal_vals

autoplot(unemp_ts, series = "Original") +
  autolayer(adj_ts,  series = "Seasonally Adjusted") +
  labs(
    title = "Original vs Seasonally Adjusted Unemployment (New York)",
    x = "Year",
    y = "UnemployedLF"
  ) +
  guides(colour = guide_legend(title = "Series")) +
  theme_minimal()

Adjusting for seasonality removes regular intra-year fluctuations and makes it easier to see underlying cyclical behavior and turning points in the labor market. The seasonally adjusted series emphasizes the impact of recessions and expansions without the recurring monthly pattern.


Naïve Method

The Naïve method assumes that the most recent observed value is the best forecast for all future periods. It provides a simple benchmark.

fit_naive <- naive(unemp_ts, h = 12)

autoplot(fit_naive) +
  labs(
    title = "Naïve Forecast for New York Unemployment (Next 12 Months)",
    x = "Year",
    y = "UnemployedLF"
  ) +
  theme_minimal()

Residual Analysis – Naïve Method

res_naive <- residuals(fit_naive)

# 1. Residuals over time
p1 <- autoplot(res_naive) +
  labs(title = "Naïve Residuals over Time",
       x = "Year", y = "Residuals") +
  theme_minimal()

# 2. Histogram of residuals
p2 <- ggplot(data.frame(res_naive), aes(x = res_naive)) +
  geom_histogram(bins = 20) +
  labs(title = "Histogram of Naïve Residuals",
       x = "Residuals", y = "Count") +
  theme_minimal()

# 3. Fitted vs Residuals
p3 <- ggplot(data.frame(fitted = fitted(fit_naive), res = res_naive),
             aes(x = fitted, y = res)) +
  geom_point() +
  geom_hline(yintercept = 0, colour = "red") +
  labs(title = "Fitted Values vs Residuals (Naïve)",
       x = "Fitted values", y = "Residuals") +
  theme_minimal()

# 4. Actual vs Residuals
p4 <- ggplot(data.frame(actual = as.numeric(unemp_ts), res = res_naive),
             aes(x = actual, y = res)) +
  geom_point() +
  geom_hline(yintercept = 0, colour = "red") +
  labs(title = "Actual Values vs Residuals (Naïve)",
       x = "Actual values", y = "Residuals") +
  theme_minimal()

grid.arrange(p1, p2, p3, p4, ncol = 2)

# 5. ACF of residuals
ggAcf(res_naive) +
  ggtitle("ACF of Naïve Residuals") +
  theme_minimal()

The residual plots typically show that the Naïve model leaves substantial autocorrelation and structure in the errors. This suggests that more sophisticated models should achieve better forecasting performance.

Accuracy and Forecast – Naïve Method

acc_naive <- accuracy(fit_naive)
acc_naive
##                       ME     RMSE       MAE        MPE     MAPE      MASE
## Training set -0.01366224 0.135389 0.1013283 -0.2129164 1.390287 0.1018414
##                   ACF1
## Training set 0.8262271
# Forecast table for the next 12 months
naive_forecast_table <- data.frame(
  Time      = time(fit_naive$mean),
  Forecast  = as.numeric(fit_naive$mean)
)
naive_forecast_table
##        Time Forecast
## 1  2023.000        4
## 2  2023.083        4
## 3  2023.167        4
## 4  2023.250        4
## 5  2023.333        4
## 6  2023.417        4
## 7  2023.500        4
## 8  2023.583        4
## 9  2023.667        4
## 10 2023.750        4
## 11 2023.833        4
## 12 2023.917        4

Although simple, the Naïve method usually does not perform well when there is strong seasonality or cyclical behavior, as is the case with unemployment.


Simple Moving Averages

Simple Moving Averages (SMA) smooth the data over fixed windows. I compare orders 3, 6, and 9.

ma3 <- ma(unemp_ts, order = 3)
ma6 <- ma(unemp_ts, order = 6)
ma9 <- ma(unemp_ts, order = 9)

autoplot(unemp_ts, series = "Original") +
  autolayer(ma3, series = "MA(3)") +
  autolayer(ma6, series = "MA(6)") +
  autolayer(ma9, series = "MA(9)") +
  labs(
    title = "Simple Moving Averages of New York Unemployment",
    x = "Year",
    y = "UnemployedLF"
  ) +
  guides(colour = guide_legend(title = "Series")) +
  theme_minimal()

Bonus

# Compute MA(6)
ma6 <- ma(unemp_ts, order = 6)

# The last available MA value is used as the forecast for all future periods
last_ma6 <- tail(na.omit(ma6), 1)

# Create 12-month SMA forecast
sma_forecast <- ts(rep(last_ma6, 12),
                   start = c(end(time(unemp_ts))[1],
                             end(time(unemp_ts))[2] + 1),
                   frequency = 12)

# Plot original series + SMA forecast
autoplot(unemp_ts, series = "Original") +
  autolayer(ma6, series = "MA(6)") +
  autolayer(sma_forecast, series = "MA(6) Forecast (Next 12 Months)") +
  labs(title = "Bonus: 12-Month Forecast Using Simple Moving Average (MA6)",
       x = "Year", y = "UnemployedLF") +
  guides(colour = guide_legend(title = "Series")) +
  theme_minimal()

# Print forecast values
sma_forecast
##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 2023 3.833333 3.833333 3.833333 3.833333 3.833333 3.833333 3.833333 3.833333
##           Sep      Oct      Nov      Dec
## 2023 3.833333 3.833333 3.833333 3.833333

Observations:

In practice, SMAs are most useful for smoothing and visualization rather than for long-horizon forecasting.


Simple Smoothing (Simple Exponential Smoothing, SES)

Simple Exponential Smoothing (SES) models the level of the series using exponentially weighted averages of past values, which is appropriate mainly for series without strong trend or seasonality. For unemployment, SES is still a useful benchmark.

fit_ses <- ses(unemp_ts, h = 12)

autoplot(fit_ses) +
  labs(
    title = "Simple Exponential Smoothing – Forecast (Next 12 Months)",
    x = "Year",
    y = "UnemployedLF"
  ) +
  theme_minimal()

SES Parameters

fit_ses$model$par
##      alpha          l 
##  0.9998996 11.2000757
sigma_ses <- sqrt(fit_ses$model$sigma2)
sigma_ses
## [1] 0.1355289
  • Alpha: controls how heavily recent data are weighted relative to older observations.
  • Initial state: is the estimated starting level of the series.
  • Sigma: is the standard deviation of residuals; lower sigma indicates a better fit.

Residual Analysis – SES

res_ses <- residuals(fit_ses)

p1_ses <- autoplot(res_ses) +
  labs(title = "SES Residuals over Time",
       x = "Year", y = "Residuals") +
  theme_minimal()

p2_ses <- ggplot(data.frame(res_ses), aes(x = res_ses)) +
  geom_histogram(bins = 20) +
  labs(title = "Histogram of SES Residuals",
       x = "Residuals", y = "Count") +
  theme_minimal()

p3_ses <- ggplot(data.frame(fitted = fitted(fit_ses), res = res_ses),
                 aes(x = fitted, y = res)) +
  geom_point() +
  geom_hline(yintercept = 0, colour = "red") +
  labs(title = "Fitted Values vs Residuals (SES)",
       x = "Fitted values", y = "Residuals") +
  theme_minimal()

# Align actual values with available SES residuals
actual_aligned <- tail(as.numeric(unemp_ts), length(res_ses))

p4_ses <- ggplot(
  data.frame(actual = actual_aligned, res = res_ses),
  aes(x = actual, y = res)
) +
  geom_point() +
  geom_hline(yintercept = 0, colour = "red") +
  labs(
    title = "Actual Values vs Residuals (SES)",
    x = "Actual values", y = "Residuals"
  ) +
  theme_minimal()

grid.arrange(p1_ses, p2_ses, p3_ses, p4_ses, ncol = 2)

ggAcf(res_ses) +
  ggtitle("ACF of SES Residuals") +
  theme_minimal()

Accuracy – SES

acc_ses <- accuracy(fit_ses)
acc_ses
##                      ME      RMSE       MAE        MPE    MAPE      MASE
## Training set -0.0136379 0.1352719 0.1011464 -0.2125391 1.38779 0.1016586
##                   ACF1
## Training set 0.8262714

SES may improve upon the Naïve method but still ignores explicit seasonality and more complex dynamics present in unemployment.


Holt-Winters Method

The Holt-Winters method extends exponential smoothing to include both trend and seasonality. Given the observed seasonal pattern in unemployment, Holt-Winters is a natural modeling choice.

Here I use additive seasonality, consistent with the decomposition.

fit_hw <- hw(unemp_ts, seasonal = "additive", h = 12)

autoplot(fit_hw) +
  labs(
    title = "Holt-Winters Forecast (Additive Seasonality, Next 12 Months)",
    x = "Year",
    y = "UnemployedLF"
  ) +
  theme_minimal()

## Holt-Winters Parameters

fit_hw$model$par
##         alpha          beta         gamma             l             b 
##  0.6488184799  0.1715057902  0.0533961560 11.1657225568 -0.0665392804 
##            s0            s1            s2            s3            s4 
##  0.0168251980  0.0820270451 -0.1089168744 -0.0005134583  0.0103756515 
##            s5            s6            s7            s8            s9 
## -0.0065451718 -0.0492780143 -0.0674140794  0.0626972225  0.1164012542 
##           s10 
## -0.0338067088
sigma_hw <- sqrt(fit_hw$model$sigma2)
sigma_hw
## [1] 0.1385577

Residual Analysis – Holt-Winters

res_hw <- residuals(fit_hw)

p1_hw <- autoplot(res_hw) +
  labs(title = "Holt-Winters Residuals over Time",
       x = "Year", y = "Residuals") +
  theme_minimal()

p2_hw <- ggplot(data.frame(res_hw), aes(x = res_hw)) +
  geom_histogram(bins = 20) +
  labs(title = "Histogram of Holt-Winters Residuals",
       x = "Residuals", y = "Count") +
  theme_minimal()

p3_hw <- ggplot(data.frame(fitted = fitted(fit_hw), res = res_hw),
                aes(x = fitted, y = res)) +
  geom_point() +
  geom_hline(yintercept = 0, colour = "red") +
  labs(title = "Fitted Values vs Residuals (Holt-Winters)",
       x = "Fitted values", y = "Residuals") +
  theme_minimal()

# Align actual values with Holt-Winters residuals
actual_hw <- tail(as.numeric(unemp_ts), length(res_hw))

p4_hw <- ggplot(data.frame(actual = actual_hw, res = res_hw),
                aes(x = actual, y = res)) +
  geom_point() +
  geom_hline(yintercept = 0, colour = "red") +
  labs(title = "Actual Values vs Residuals (Holt-Winters)",
       x = "Actual values", y = "Residuals") +
  theme_minimal()

grid.arrange(p1_hw, p2_hw, p3_hw, p4_hw, ncol = 2)

ggAcf(res_hw) +
  ggtitle("ACF of Holt-Winters Residuals") +
  theme_minimal()

Accuracy – Holt-Winters

acc_hw <- accuracy(fit_hw)
acc_hw
##                       ME      RMSE       MAE       MPE    MAPE      MASE
## Training set 0.001062708 0.1364422 0.1099957 0.1269126 1.51283 0.1105528
##                   ACF1
## Training set 0.7375743

The Holt-Winters model usually provides better accuracy than Naïve and SES because it explicitly models seasonality, which is important for unemployment.


ARIMA / Box–Jenkins

In this section, I use the Box–Jenkins methodology to build an ARIMA model for New York unemployment.

Stationarity and Differencing

# Augmented Dickey–Fuller test on the original series
# Remove NA values before ADF test
unemp_ts_clean <- na.omit(unemp_ts)

# Augmented Dickey–Fuller test on cleaned series
adf_test_original <- adf.test(unemp_ts_clean)
adf_test_original
## 
##  Augmented Dickey-Fuller Test
## 
## data:  unemp_ts_clean
## Dickey-Fuller = -3.7447, Lag order = 8, p-value = 0.02177
## alternative hypothesis: stationary

Yes, the series is stationary. The ADF test gives p = 0.021 < 0.05, so we reject the null of non-stationarity.

# Non-seasonal differencing
d_required  <- ndiffs(unemp_ts)
# Seasonal differencing with period 12
D_required <- nsdiffs(unemp_ts)

d_required
## [1] 1
D_required
## [1] 0

Based on these diagnostics, I difference the series d_required times non-seasonally and D_required times seasonally (with period 12).

unemp_diff <- diff(unemp_ts, differences = d_required)

if (D_required > 0) {
  unemp_diff <- diff(unemp_diff, lag = 12, differences = D_required)
}

autoplot(unemp_diff) +
  labs(
    title = "Differenced Unemployment Series (Stationary)",
    x = "Year",
    y = "Differenced UnemployedLF"
  ) +
  theme_minimal()

ACF and PACF of Differenced Series

p_acf  <- ggAcf(unemp_diff)  + ggtitle("ACF of Differenced Series")  + theme_minimal()
p_pacf <- ggPacf(unemp_diff) + ggtitle("PACF of Differenced Series") + theme_minimal()

grid.arrange(p_acf, p_pacf, ncol = 2)

From the ACF and PACF plots, I identify possible low-order ARIMA models (with small p and q) and a seasonal component consistent with yearly (12-month) seasonality. For example, patterns in the ACF and PACF may suggest models such as:

  • ARIMA(1, d, 1)(0, D, 1)[12]
  • ARIMA(2, d, 1)(0, D, 1)[12]
  • ARIMA(1, d, 0)(0, D, 1)[12]

where d = d_required and D = D_required.

Candidate ARIMA Models and Model Selection

To aid selection, I fit a small set of candidate seasonal ARIMA models and compare them using AIC, BIC, and sigma².

# Use auto.arima as a starting point
fit_auto <- auto.arima(unemp_ts, seasonal = TRUE, stepwise = FALSE, approximation = FALSE)

# Construct a few candidate models around the automatic choice
fit_arima1 <- fit_auto
fit_arima2 <- try(Arima(unemp_ts, order = c(1, d_required, 1),
                        seasonal = c(0, D_required, 1)), silent = TRUE)
fit_arima3 <- try(Arima(unemp_ts, order = c(2, d_required, 1),
                        seasonal = c(0, D_required, 1)), silent = TRUE)

candidates <- list(
  Auto   = fit_arima1,
  Model2 = if (inherits(fit_arima2, "try-error")) NULL else fit_arima2,
  Model3 = if (inherits(fit_arima3, "try-error")) NULL else fit_arima3
)

# Remove NULL entries
candidates <- candidates[!sapply(candidates, is.null)]

aic_table <- data.frame(
  Model  = names(candidates),
  AIC    = sapply(candidates, AIC),
  BIC    = sapply(candidates, BIC),
  Sigma2 = sapply(candidates, function(x) x$sigma2)
)

aic_table
##         Model       AIC       BIC      Sigma2
## Auto     Auto -1301.683 -1276.080 0.004867714
## Model2 Model2 -1243.409 -1226.340 0.005466110
## Model3 Model3 -1270.783 -1249.447 0.005177489

The model with the lowest AIC and BIC, and reasonably small sigma², is selected as the final ARIMA model.

best_index <- which.min(aic_table$AIC)
best_name  <- aic_table$Model[best_index]
best_name
## [1] "Auto"
fit_arima_best <- candidates[[best_name]]
fit_arima_best
## Series: unemp_ts 
## ARIMA(1,1,2)(0,0,2)[12] 
## 
## Coefficients:
##          ar1      ma1     ma2     sma1     sma2
##       0.8431  -0.2619  0.3433  -0.0833  -0.1417
## s.e.  0.0280   0.0474  0.0438   0.0439   0.0436
## 
## sigma^2 = 0.004868:  log likelihood = 656.84
## AIC=-1301.68   AICc=-1301.52   BIC=-1276.08

The output above provides the final ARIMA formula, including the estimated coefficients.

Residual Analysis – ARIMA

res_arima <- residuals(fit_arima_best)

p1_ar <- autoplot(res_arima) +
  labs(title = "ARIMA Residuals over Time",
       x = "Year", y = "Residuals") +
  theme_minimal()

p2_ar <- ggplot(data.frame(res_arima), aes(x = res_arima)) +
  geom_histogram(bins = 20) +
  labs(title = "Histogram of ARIMA Residuals",
       x = "Residuals", y = "Count") +
  theme_minimal()

p3_ar <- ggplot(data.frame(fitted = fitted(fit_arima_best), res = res_arima),
                aes(x = fitted, y = res)) +
  geom_point() +
  geom_hline(yintercept = 0, colour = "red") +
  labs(title = "Fitted Values vs Residuals (ARIMA)",
       x = "Fitted values", y = "Residuals") +
  theme_minimal()

p4_ar <- ggplot(data.frame(actual = as.numeric(unemp_ts), res = res_arima),
                aes(x = actual, y = res)) +
  geom_point() +
  geom_hline(yintercept = 0, colour = "red") +
  labs(title = "Actual Values vs Residuals (ARIMA)",
       x = "Actual values", y = "Residuals") +
  theme_minimal()

grid.arrange(p1_ar, p2_ar, p3_ar, p4_ar, ncol = 2)

ggAcf(res_arima) +
  ggtitle("ACF of ARIMA Residuals") +
  theme_minimal()

If the ARIMA residuals resemble white noise (no obvious pattern in the residual plot and no significant autocorrelations), the selected model is considered adequate.

Accuracy and Forecast – ARIMA

acc_arima <- accuracy(fit_arima_best)
acc_arima
##                        ME       RMSE        MAE         MPE      MAPE
## Training set -0.002285899 0.06937146 0.05448234 -0.01617532 0.7702565
##                    MASE       ACF1
## Training set 0.05475825 0.01324539

Forecast Next 1 Year

fc_1y <- forecast(fit_arima_best, h = 12)

autoplot(fc_1y) +
  labs(
    title = "ARIMA Forecast – Next 12 Months",
    x = "Year",
    y = "UnemployedLF"
  ) +
  theme_minimal()

data.frame(
  Time     = time(fc_1y$mean),
  Forecast = as.numeric(fc_1y$mean)
)
##        Time Forecast
## 1  2023.000 4.644160
## 2  2023.083 4.642600
## 3  2023.167 4.641284
## 4  2023.250 4.640175
## 5  2023.333 4.639240
## 6  2023.417 4.638452
## 7  2023.500 4.637787
## 8  2023.583 4.637227
## 9  2023.667 4.636754
## 10 2023.750 4.636356
## 11 2023.833 4.636020
## 12 2023.917 4.635737

Forecast Next 2 Years

fc_2y <- forecast(fit_arima_best, h = 24)

autoplot(fc_2y) +
  labs(
    title = "ARIMA Forecast – Next 24 Months",
    x = "Year",
    y = "UnemployedLF"
  ) +
  theme_minimal()

data.frame(
  Time     = time(fc_2y$mean),
  Forecast = as.numeric(fc_2y$mean)
)
##        Time Forecast
## 1  2023.000 4.644160
## 2  2023.083 4.642600
## 3  2023.167 4.641284
## 4  2023.250 4.640175
## 5  2023.333 4.639240
## 6  2023.417 4.638452
## 7  2023.500 4.637787
## 8  2023.583 4.637227
## 9  2023.667 4.636754
## 10 2023.750 4.636356
## 11 2023.833 4.636020
## 12 2023.917 4.635737
## 13 2024.000 4.635498
## 14 2024.083 4.635297
## 15 2024.167 4.635127
## 16 2024.250 4.634984
## 17 2024.333 4.634863
## 18 2024.417 4.634762
## 19 2024.500 4.634676
## 20 2024.583 4.634604
## 21 2024.667 4.634543
## 22 2024.750 4.634491
## 23 2024.833 4.634448
## 24 2024.917 4.634411

The ARIMA forecasts provide a detailed view of expected unemployment over the next one and two years, including prediction intervals.


Accuracy Summary

In this section, I compare the Naïve, SES, Holt-Winters, and ARIMA models based on their accuracy measures.

naive_row <- acc_naive[1, ]
ses_row   <- acc_ses[1, ]
hw_row    <- acc_hw[1, ]
arima_row <- acc_arima[1, ]

acc_table <- rbind(
  Naive = naive_row,
  SES   = ses_row,
  HW    = hw_row,
  ARIMA = arima_row
)

acc_table
##                 ME       RMSE        MAE         MPE      MAPE       MASE
## Naive -0.013662239 0.13538896 0.10132827 -0.21291640 1.3902871 0.10184143
## SES   -0.013637896 0.13527194 0.10114638 -0.21253905 1.3877897 0.10165861
## HW     0.001062708 0.13644219 0.10999571  0.12691261 1.5128298 0.11055275
## ARIMA -0.002285899 0.06937146 0.05448234 -0.01617532 0.7702565 0.05475825
##             ACF1
## Naive 0.82622706
## SES   0.82627142
## HW    0.73757428
## ARIMA 0.01324539

Key observations:

Typically, the Naïve model has the worst performance. SES may improve slightly by smoothing the level, but it does not capture seasonality. Holt-Winters often produces substantially better accuracy by modeling both trend and seasonality. The ARIMA model commonly achieves the lowest error measures when tuned correctly, because it flexibly captures autocorrelation and seasonal patterns.

Definitions and Usefulness of Each Forecast Method

  • Naïve: Uses the last observation as the forecast. Easy to compute; useful as a benchmark.
  • Simple Moving Averages: Smooths short-term noise to highlight underlying patterns; useful for visualization and exploratory analysis.
  • Simple Exponential Smoothing (SES): Models a series with a constant level; useful when there is no strong trend or seasonality.
  • Holt-Winters: Extends exponential smoothing to include trend and seasonality; useful for data with regular seasonal patterns.
  • ARIMA (Box–Jenkins): Models autocorrelation structure in the series; can incorporate both non-seasonal and seasonal components; useful for capturing complex dynamics and producing accurate forecasts.

For each accuracy measure (RMSE, MAE, MAPE), the best model is the one with the lowest value, and the worst model is the one with the highest value. In this application, the Holt-Winters and ARIMA models are expected to dominate the Naïve and SES models.


Conclusion

This final exam analysis of New York unemployment (UnemployedLF) leads to several key conclusions:

Based on the Holt-Winters and ARIMA forecasts:

Ranking the forecasting methods for this time series based on their historical performance:

  1. ARIMA (Box–Jenkins) – most accurate and flexible, best overall forecasting performance.
  2. Holt-Winters – very good accuracy and clear interpretation of level, trend, and seasonality.
  3. SES – better than Naïve but limited by the lack of explicit seasonality.
  4. Naïve – simplest benchmark with the weakest performance.

Overall, the ARIMA and Holt-Winters models provide the most reliable tools for forecasting unemployment in New York, while the simpler methods serve as useful reference points and checks on model complexity.